Improving performance of Text Categorization: Using Multi Feature Coselection Clustering Technique and Lsquare Machine Learning
نویسنده
چکیده
Text categorization is continuing to be one of the most researched NLP problems due to the ever-increasing amounts of electronic documents and digital libraries. In this paper, we present a novel text categorization method that combines the Multitype Features Coselection for Clustering and a learning logic technique, called Lsquare, for constructing text classifiers. The high dimensionality of text in a document has not been fruitful for the task of categorization, for which reason, feature clustering has been proven to be an ideal alternative to feature selection for reducing the dimensionality. We, therefore, use Multitype Features Coselection for Clustering (MFCC) to generate an efficient representation of documents and apply Lsquare for training text classifiers. The method was extensively tested and evaluated. The proposed method achieves higher or comparable classification accuracy and F1 results compared with SVM. MFCC improves clustering performance.
منابع مشابه
Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA
With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...
متن کاملNew Approach for Data classification using Multi view graph learning Technique
Text classification approach gaining more importance because of the accessibility of large number of electronic documents from a variety of resource. Text categorization (Also called Text Categorization) is the task of assigning predefined categories to documents. It is the method of finding interesting regularities in large textual, where interesting means non trivial, hidden, previously unkno...
متن کاملImproving Feature Selection Techniques for Machine Learning
As a commonly used technique in data preprocessing for machine learning, feature selection identifies important features and removes irrelevant, redundant or noise features to reduce the dimensionality of feature space. It improves efficiency, accuracy and comprehensibility of the models built by learning algorithms. Feature selection techniques have been widely employed in a variety of applica...
متن کاملLearning with Unlabeled Data for Text Categorization Using a Bootstrapping and a Feature Projection Technique
A wide range of supervised learning algorithms has been applied to Text Categorization. However, the supervised learning approaches have some problems. One of them is that they require a large, often prohibitive, number of labeled training documents for accurate learning. Generally, acquiring class labels for training data is costly, while gathering a large quantity of unlabeled data is cheap. ...
متن کاملAuthorship Attribution Based on Feature Set Subspacing Ensembles
Authorship attribution can assist the criminal investigation procedure as well as cybercrime analysis. This task can be viewed as a single-label multi-class text categorization problem. Given that the style of a text can be represented as mere word frequencies selected in a language-independent method, suitable machine learning techniques able to deal with high dimensional feature spaces and sp...
متن کامل